{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Image Deduplication with FiftyOne\n",
    "\n",
    "This recipe demonstrates a simple use case of using FiftyOne to detect and\n",
    "remove duplicate images from your dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Requirements\n",
    "\n",
    "This notebook requires the `tensorflow` package:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install tensorflow"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download the data\n",
    "\n",
    "First we download the dataset to disk. The dataset is a 1000 sample subset of\n",
    "CIFAR-100, a dataset of 32x32 pixel images with one of 100 different\n",
    "classification labels such as `apple`, `bicycle`, `porcupine`, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading dataset of 1000 samples to:\n",
      "\t/tmp/fiftyone/cifar100_with_duplicates\n",
      "and corrupting the data (5% duplicates)\n",
      "Download successful\n"
     ]
    }
   ],
   "source": [
    "from image_deduplication_helpers import download_dataset\n",
    "\n",
    "download_dataset()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above script uses `tensorflow.keras.datasets` to download the dataset, so\n",
    "you must have [TensorFlow installed](https://www.tensorflow.org/install).\n",
    "\n",
    "The dataset is organized on disk as follows:\n",
    "\n",
    "```\n",
    "/tmp/fiftyone/\n",
    "└── cifar100_with_duplicates/\n",
    "    ├── <classA>/\n",
    "    │   ├── <image1>.jpg\n",
    "    │   ├── <image2>.jpg\n",
    "    │   └── ...\n",
    "    ├── <classB>/\n",
    "    │   ├── <image1>.jpg\n",
    "    │   ├── <image2>.jpg\n",
    "    │   └── ...\n",
    "    └── ...\n",
    "```\n",
    "\n",
    "As we will soon come to discover, some of these samples are duplicates and we\n",
    "have no clue which they are!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a dataset\n",
    "\n",
    "Let's start by importing the FiftyOne library:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import fiftyone as fo"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use a utililty method provided by FiftyOne to load the image\n",
    "classification dataset from disk:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 100% |█████| 1000/1000 [529.4ms elapsed, 0s remaining, 1.9K samples/s]      \n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "\n",
    "dataset_name = \"cifar100_with_duplicates\"\n",
    "dataset_dir = os.path.join(\"/tmp/fiftyone\", dataset_name)\n",
    "\n",
    "dataset = fo.Dataset.from_dir(\n",
    "    dataset_dir,\n",
    "    fo.types.ImageClassificationDirectoryTree,\n",
    "    name=dataset_name\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explore the dataset\n",
    "\n",
    "We can poke around in the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Name:           cifar100_with_duplicates\n",
      "Persistent:     False\n",
      "Num samples:    1000\n",
      "Tags:           []\n",
      "Sample fields:\n",
      "    filepath:     fiftyone.core.fields.StringField\n",
      "    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
      "    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n",
      "    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)\n"
     ]
    }
   ],
   "source": [
    "# Print summary information about the dataset\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<Sample: {\n",
      "    'dataset_name': 'cifar100_with_duplicates',\n",
      "    'id': '5f21e95b47c8f4cc5552aa08',\n",
      "    'filepath': '/tmp/fiftyone/cifar100_with_duplicates/apple/113.jpg',\n",
      "    'tags': BaseList([]),\n",
      "    'metadata': None,\n",
      "    'ground_truth': <Classification: {'label': 'apple', 'confidence': None, 'logits': None}>,\n",
      "}>\n"
     ]
    }
   ],
   "source": [
    "# Print a sample\n",
    "print(dataset.first())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a view that contains only samples whose ground truth label is\n",
    "`mountain`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset:        cifar100_with_duplicates\n",
      "Num samples:    7\n",
      "Tags:           []\n",
      "Sample fields:\n",
      "    filepath:     fiftyone.core.fields.StringField\n",
      "    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
      "    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n",
      "    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)\n",
      "Pipeline stages:\n",
      "    1. Match(filter={'$expr': {'$eq': [...]}})\n"
     ]
    }
   ],
   "source": [
    "# Used to write view expressions that involve sample fields\n",
    "from fiftyone import ViewField as F\n",
    "\n",
    "view = dataset.match(F(\"ground_truth.label\") == \"mountain\")\n",
    "\n",
    "# Print summary information about the view\n",
    "print(view)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<Sample: {\n",
      "    'dataset_name': 'cifar100_with_duplicates',\n",
      "    'id': '5f21e95b47c8f4cc5552ac04',\n",
      "    'filepath': '/tmp/fiftyone/cifar100_with_duplicates/mountain/0.jpg',\n",
      "    'tags': BaseList([]),\n",
      "    'metadata': None,\n",
      "    'ground_truth': <Classification: {'label': 'mountain', 'confidence': None, 'logits': None}>,\n",
      "}>\n"
     ]
    }
   ],
   "source": [
    "# Print the first sample in the view\n",
    "print(view.first())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a view with samples sorted by their ground truth labels in reverse\n",
    "alphabetical order:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset:        cifar100_with_duplicates\n",
      "Num samples:    1000\n",
      "Tags:           []\n",
      "Sample fields:\n",
      "    filepath:     fiftyone.core.fields.StringField\n",
      "    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
      "    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n",
      "    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)\n",
      "Pipeline stages:\n",
      "    1. SortBy(field_or_expr='ground_truth.label', reverse=True)\n"
     ]
    }
   ],
   "source": [
    "view = dataset.sort_by(\"ground_truth.label\", reverse=True)\n",
    "\n",
    "# Print summary information about the view\n",
    "print(view)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<Sample: {\n",
      "    'dataset_name': 'cifar100_with_duplicates',\n",
      "    'id': '5f21e95b47c8f4cc5552ade2',\n",
      "    'filepath': '/tmp/fiftyone/cifar100_with_duplicates/worm/167.jpg',\n",
      "    'tags': BaseList([]),\n",
      "    'metadata': None,\n",
      "    'ground_truth': <Classification: {'label': 'worm', 'confidence': None, 'logits': None}>,\n",
      "}>\n"
     ]
    }
   ],
   "source": [
    "# Print the first sample in the view\n",
    "print(view.first())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualize the dataset\n",
    "\n",
    "Start browsing the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "App launched\n"
     ]
    }
   ],
   "source": [
    "session = fo.launch_app(dataset=dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![dataset](images/dedup_1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Narrow your scope to 10 random samples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "session.view = dataset.take(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![take](images/dedup_2.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Click on some some samples in the GUI to select them and access their IDs from\n",
    "code!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the IDs of the currently selected samples in the App\n",
    "sample_ids = session.selected"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a view that contains your currently selected samples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "selected_view = dataset.select(session.selected)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Update the App to only show your selected samples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "session.view = selected_view"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![selected](images/dedup_3.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compute file hashes\n",
    "\n",
    "Iterate over the samples and compute their file hashes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Name:           cifar100_with_duplicates\n",
      "Persistent:     False\n",
      "Num samples:    1000\n",
      "Tags:           []\n",
      "Sample fields:\n",
      "    filepath:     fiftyone.core.fields.StringField\n",
      "    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
      "    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n",
      "    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)\n",
      "    file_hash:    fiftyone.core.fields.IntField\n"
     ]
    }
   ],
   "source": [
    "import fiftyone.core.utils as fou\n",
    "\n",
    "for sample in dataset:\n",
    "    sample[\"file_hash\"] = fou.compute_filehash(sample.filepath)\n",
    "    sample.save()\n",
    "\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have two ways to visualize this new information.\n",
    "\n",
    "First, you can view the sample from your Terminal:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<Sample: {\n",
      "    'dataset_name': 'cifar100_with_duplicates',\n",
      "    'id': '5f21e95b47c8f4cc5552aa08',\n",
      "    'filepath': '/tmp/fiftyone/cifar100_with_duplicates/apple/113.jpg',\n",
      "    'tags': BaseList([]),\n",
      "    'metadata': None,\n",
      "    'ground_truth': <Classification: {'label': 'apple', 'confidence': None, 'logits': None}>,\n",
      "    'file_hash': -282626705163262820,\n",
      "}>\n"
     ]
    }
   ],
   "source": [
    "sample = dataset.first()\n",
    "print(sample)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or you can refresh the App and toggle on the new `file_hash` field:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "session.dataset = dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![dataset2](images/dedup_4.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check for duplicates\n",
    "\n",
    "Now let's use a simple Python statement to locate the duplicate files in the\n",
    "dataset, i.e., those with the same file hashses:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of duplicate file hashes: 45\n"
     ]
    }
   ],
   "source": [
    "from collections import Counter\n",
    "\n",
    "filehash_counts = Counter(sample.file_hash for sample in dataset)\n",
    "dup_filehashes = [k for k, v in filehash_counts.items() if v > 1]\n",
    "\n",
    "print(\"Number of duplicate file hashes: %d\" % len(dup_filehashes))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's create a view that contains only the samples with these duplicate\n",
    "file hashes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of images that have a duplicate: 91\n",
      "Number of duplicates: 46\n"
     ]
    }
   ],
   "source": [
    "dup_view = (dataset\n",
    "    # Extract samples with duplicate file hashes\n",
    "    .match(F(\"file_hash\").is_in(dup_filehashes))\n",
    "    # Sort by file hash so duplicates will be adjacent\n",
    "    .sort_by(\"file_hash\")\n",
    ")\n",
    "\n",
    "print(\"Number of images that have a duplicate: %d\" % len(dup_view))\n",
    "print(\"Number of duplicates: %d\" % (len(dup_view) - len(dup_filehashes)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course, we can always use the App to visualize our work!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "session.view = dup_view"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![dup-view](images/dedup_5.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Delete duplicates\n",
    "\n",
    "Now let's delete the duplicate samples from the dataset using our `dup_view` to\n",
    "restrict our attention to known duplicates:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Length of dataset before: 1000\n",
      "Length of dataset after: 954\n",
      "Number of unique file hashes: 954\n"
     ]
    }
   ],
   "source": [
    "print(\"Length of dataset before: %d\" % len(dataset))\n",
    "\n",
    "_dup_filehashes = set()\n",
    "for sample in dup_view:\n",
    "    if sample.file_hash not in _dup_filehashes:\n",
    "        _dup_filehashes.add(sample.file_hash)\n",
    "        continue\n",
    "\n",
    "    del dataset[sample.id]\n",
    "\n",
    "print(\"Length of dataset after: %d\" % len(dataset))\n",
    "\n",
    "# Verify that the dataset no longer contains any duplicates\n",
    "print(\"Number of unique file hashes: %d\" % len({s.file_hash for s in dataset}))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Export the deduplicated dataset\n",
    "\n",
    "Finally, let's export a fresh copy of our now-duplicate-free dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 100% |███████| 954/954 [808.2ms elapsed, 0s remaining, 1.2K samples/s]       \n"
     ]
    }
   ],
   "source": [
    "EXPORT_DIR = \"/tmp/fiftyone/image-deduplication\"\n",
    "\n",
    "dataset.export(label_field=\"ground_truth\", export_dir=EXPORT_DIR)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check out the contents of `/tmp/fiftyone/image-deduplication` on disk to see how the data is\n",
    "organized.\n",
    "\n",
    "You can load the deduplicated dataset that you exported back into FiftyOne at\n",
    "any time as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 100% |███████| 954/954 [547.2ms elapsed, 0s remaining, 1.7K samples/s]      \n",
      "Name:           no_duplicates\n",
      "Persistent:     False\n",
      "Num samples:    954\n",
      "Tags:           []\n",
      "Sample fields:\n",
      "    filepath:     fiftyone.core.fields.StringField\n",
      "    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)\n",
      "    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)\n",
      "    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)\n"
     ]
    }
   ],
   "source": [
    "no_dups_dataset = fo.Dataset.from_dir(\n",
    "    EXPORT_DIR,\n",
    "    fo.types.FiftyOneImageClassificationDataset,\n",
    "    name=\"no_duplicates\",\n",
    ")\n",
    "\n",
    "print(no_dups_dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cleanup\n",
    "\n",
    "You can cleanup the files generated by this recipe by running:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm -rf /tmp/fiftyone"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
